09. Getting Stopwords from NLTK

Getting Stopwords from NLTK

Question:

Start Quiz:

Solution:

INSTRUCTOR NOTE:

Depending on your setup, downloading the corpus with the GUI (like I do) can be slow and painful. Here's a stack overflow page about downloading it via the command line: http://stackoverflow.com/questions/5843817/programmatically-install-nltk-corpora-models-i-e-without-the-gui-downloader

Note: Version 3.1 of NLTK has a bug with obtaining and downloading the 'panlex_lite' corpus. While this is scheduled to be fixed in version 3.2, you can follow these steps to install this corpus in the meantime:

  1. Use nltk.download('all', halt_on_error=False) to get all of the corpora except for the 'panlex_lite' corpus.
  2. You should have a folder on your computer called "nltk_data" which holds all of the downloaded files referenced by nltk . (You might find it in your "/Users/ username /" folder.) Save the archived version of the corpus from this link into the "nltk_data/corpora" folder. Warning: The zip file is size 1.7 GB!
  3. Unzip the folder. You should have a file structure that looks like "nltk_data/corpora/panlex_lite/" which contains two files with the unarchived corpus data.

An update to the stopwords corpus in March 2016 updated the number of English stopwords: your answer should be 153 with the most recent corpus data.